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Abstract 

This paper examines some challenges to the validity of existing multiple-choice critical thinking tests 
and proposes how the validity of such tests might be put on a sounder footing. Several plausible 
hypotheses are proposed for explaining variance on critical thinking tests other than the hypothesis that 
the variance is due to differences in critical thinking. There is no evidence to support or rule out these 
alternative explanations. It is afped that askmg samples of subjects to provide verbal reports of their 
thinking while working on such tests is a way to gather the needed evidence. The argument is 
supported by a study which showed that the thinking revealed in subjects' verbal reports while taking a 
test IS likely an accurate representation of the thinking which they would have followed had they taken 
the test m its normal paper-and-pencil format. 
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INFORMAL REASONING ASSESSMENT: USING VERBAL REPORTS OF 
THINKING TO IMPROVE MULTIPLE-CHOICE TEST VA^^^ 

A commonly understood characteristic of informal reasoning is that it can lead to multiple solutions for 
problems through multiple reasoning approaches. To accept such a state of affairs one does not have 
to be a complete anarchist, in the sense of being prepared to accept any solutions and any reasoning 
approaches. Restrictions can be made on the range of solutions and approaches that still leave room 
tor considerable diversity. 

Nevertheless, the possibiUty of diverse outcomes reached through diverse approaches creates problems 
for mformal reasonmg assessment. The problems are particularly acute when assessments involve the 
use of multiple-choice tests, since such tests reveal examinees' choices of answers but not the reasoning 
which led to those choices. If answers different from those keyed correct can be justified, then it is 
difficult to infer from examinees' answers alone the quality of their reasoning. If an examinee chooses 
the keyed answer, how proper is it to infer that some acceptable reasoning process was followed' 
Alternatively, if an examinee chooses an unkeyed response, how sound is the inference that an 
unacceptable reasoning process was followed? 

Despite their shortcomings for informal reasoning assessment, multiple-choice tests are popular and 
hkely to remam so. They are one major factor controlling instruction at the elementary and secondary 
school levels anr indeed, one of the best means available for assessing some aspects of inform^ 
reasonmg competence (Tomko & Ennis, 1980). This is nc to say that multiple-choice tests can be 
used for aU purposes. Essay tests, interviewing individual students, and direct classroom observr^tion 
can serve purposes and yield information which multiple-choice tests cannot. For instance aU three 
seem better suited than multiple-choice tests to assessing informal reasoning dispositions (Norris & 
Enms, m press) But using multiple-choice tests is probably the best way to develop student profiles on 
the many specific abUities which comprise informal reasoning, such as the ability to use the many 
criteria which are needed for judging the credibilitj/ of sources. 

We are thus torn by two facts: (a) informal reasoning competence generaUy refers to the abiUty to use 
sound reasoning processes, rather than to the provision of adequate answers to tasks; and (b) multiple- 
choice tests, which provide no direct evidence on the reasoning processes used to accomplish tasks, are 
a popular and important approach for assessing informal reasoning competence. A question is whether 
existmg multiple-choice tests of informal reasoning can adequately support inferences about the qualit" 
ot reasoning processes and, if not, whether test construction practices can be improved so that future 
multiple-choice tests will be more valid. 

This paper begins by challenging the validity of existing multiple-choice tests of informal reasoning 
The pomt is made that the methodologies used to design these tests generally provide no direct 
eviaence to counter the chaUenges. The second section proposes that eliciting verbal reports of 
thinkmg from exammees oa trial test items is a way to obtain the direct evidence required. Research 
on the use of verbal reports in testing is sparse and provides Uttle clear guidance on the usefulness of 
verba^ reports for multiple-choice test vaUdation. Some relevant research on verbal reporting from 
outside of testmg is described, but there are stiU unresolved issues concerning the use of verbal reports 
of thinking for test validation. The third section reports a study designed to test the relevance of the 
evidence m verbal reports of thinking for validating, multiple-choice tests of informal reasoning. The 
results suggest strongly that the evidence is relevant. Several impUcations for informal reasoning 
assessment are discussed in the final section. 

Problems with Multiple-Choice Informal Reasoning Tests 

When using multiple-choice tests of informal reasoning it is necessary to infer from the answers 
selected by exammees the reasoning processes they followed in reaching those answers. Several factors 
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can render such inferences untrustworthy. The operation of four of these factors wiU be exemplified- 
the degree of mf^rmal reasoning sophistication of examinees; the background empirical beliefs of 
exammees; the assumptions which examinees bring to test items; and the political and religious 
Ideologies held by examinees. The four factors overlap conceptually and empirically, but it is useful to 
distinguish them m this discussion in order to highlight different aspects of the overall problem of 
validating multiple-choice informal reasoning tests. 

Degree of Sophistication of Examinees 

Multiple-choice test items typically aUow for only one correct answer. This restriction can create 
problems when testing reasoners with different degrees of sophistication in informal reasoning. By 
different degrees of sophistication" I do not mean merely "different competence." A Grand M^ter is 
so much better than I at chess that comparisons of our competence are ahnost meaningless because we 
are m entirely different reference groups. It is this sort of difference lhat I am attempting to portray 
here, because the advertised audience for many multiple-choice informal reasoning tests is so broad as 
to make one wonder whether entirely different reference groups are being considered. 

Let us examme an item from Section I of the ComeU Critical Thinking Test Uvel X (Ennis & 
MUInan, 1985a), a popular multiple-choice test which assesses several aspects of informal reasoning 
competence. The test is aimed primarUy at high school and undergraduate coUege students, but is 
recommended for use as low as fourth grade. Items are cast in the context of a story of a team of 
explorers that has just arrived on the newly discovered planet Nicoma, The explorers are searching for 
other explorers who arrived on Nicoma two years previously, but who have not been contacted since 
Each Item m Section I presents some information discovered by members of the second team and 
examinees are to decide whether the information is evidence for, evidence against, or neither evidence 
for nor against the hypothesis that aU the members of the first team are dead. Here is the first item: 

1. You go into the first hut. Everything is covered by a thick layer of dust. 

The keyed ar swer is that the item presents evidence for the hypothesis that the members of the first 
group are aU dead. However, judgments of the. direction of evidence can vary legitimately with the 
mformal reasomng sophistication of examinees. Suppose, reasoning in the foUowing mamier, an 
examinee concluded that the mformation in Item 1 was evidence neither fornor against the hypothesis 
that all the members of the first team are dead: 

I conclude that the information in Item 1 is evidence neither for nor against the 
hypothesis that aU the members of the first team are dead. There are just too many 
ways to explam the mformation and we do not have sufficient information to choose 
among the possibilities. Maybe the first team stopped using this hut. Maybe they are 
usmg the hut for activity that raises a lot of dust. Maybe they have moved to another 
place on Nicoma. Maybe in fact they are all dead. Given that aU of these possibUities 
can explam the information and given that there is insufficient information to choose 
among the possibilities, my theory of evidence says that the information is evidence 
neither for nor agamst any of the possibilities, including the hypothesis that aU the 
members of the first team are dead. 

There may be reason to disagree with the reasoning of this examinee. However, it is unlikely that the 
reasoning could be considered bad. In fact, the person's reasoning is quite sophisticated and it is this 
very sophistication which led to choosing an answer for Item 1 other than the one keyed correct 
However, concurring with the key and marking the examinee's answer incorrect would not do justice to 
the level of the person s thinking. In a paper-and-pencil sitting where choice of answer is aU that b 
recorded, this is exactly what would happen. is. «tu in<« is 
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The ss^ne pomt can be illustrated using an item from the Cornell Critical Thinking Test Level Z (Ennis 
■ & MiUmau, 1985b), a test aimed at more sophisticated rea^oners than Level X. The item is in Section 
11 of the test and portrays two people debating whether or not the drinking water of Gallton ought to 
oe chlorinated. Some thinking in the debate is faulty and,.for each item, examinees are to choose from 
a list the best reason why the thinking is faulty. Here is the item; 

11. DOBERT: I hear that you and some other crackpots are trying tn 
get Gallton to chlorinate its water supply. You seem 
to think that this will do some good. There can be no 
doubt that either we should chlorinate or we shouldn't. 
Only a fool would be in favor of chlorinating the water, 
so we ought not to do it. 

ALGAN: You are correct at least in saying that we are trying to 
get the water chlorinated. 

Pick the one best reason why some of this thinking is faulty. 

A. Dobert is mistakenly assuming that there are only two 
alternatives. 

B. Dobert is using a 'rtOrd in two ways. 

C. Dobert is using emotional language that doesn't help 
to make his argument reasonable. 

Alternative 4 appears to be true, since there are many alternatives, that range from not chlorinating at 
al to chlormatmg usmg different concentrations of chlorine. Alternative B does not seem to be true 
Alternative £ however, also appears true. There is thus a problem of deciding whether A or C offers 
the best reason for saying some of Dobert's thinking is faulty. The keyed answer is £ on the grounds 
that, compared to the objection in £. it is insignificant to object that there are more than the two 
alternatives Dobert considers. However, a sophisticated informal reasoner might choose A on the 
grounds that it is Dobert's misunderstanding of chlorination which leads to his emotional outcry. The 
person might reason that if Dobert had .a understanding that chlorination can occur in different 
degrees, then Dobert might have concluded that some level of chlorination is tolerable and not have 
become emotional. A sophisticated reasoner is more likely to see how people's beUefs, even about 
technical matters such as levels of chlorination, can affect their emotional responses. But thJs very 
sophistication can lead to being marked wrong on paper-and-pendi multiple-choice tests. 

Problems can arise in other ways because of the different degrees of sophistication of examinees 
5>ome items used to test for informal reasoning ask examinees to choose a level of endorsement for 
conclusions. However, examinees with different degrees of sophktication can justifiably choose 
different levels of endorsement, leading again to the possibility of examinees choosing unkeyed 
answers, even though they reasoned weU. An example of such an item is found in the Watson-Glaser 
Critical Thinking Appraisal (Watson & Glaser, 1980), a test designed primarily for the junior high 
school level on up. In the item, examinees are to read a passage and assume that what it says is true. 
They then read a statement and judge, based on the informtion in the passage, whether it is True 
Probably True, Probably False, or False, or that there is Insufficient Data to decide. The analysis 
which foUows IS based upon an analyses in Ennis and Norris (in press) and Norris and Ennis (in press). 
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Mr. Brown, who lives m the town of Salem, was brought before the Salem municipal 
court for the sixth time in the past month on a charge of keeping his pool hall open 
after 1 a.m. He again admitted his guilt and was fined the maximum, $500, as in each 
earlier instance. 

6. On some nights it was to Mr. Brown's advantage to keep his pool 
hall open after 1 a.m., even at the risk of paying a $500 fine. 

The answer keyed correct is Probably True, which is said m the test manual to mean that it is more 
likely to be true than false that on some nights it was to Mr. Brown's advantage to keep his pool hall 
often after 1 a.m. However, a sophisticated informal reasoner might be able to imagip.e several 
alternative explanations of th'-. facts, Mr. Brown might not have kept the pool hall open, but his son 
whom Mr. Brown had recently put m charge of the busbess, kept it open. Mr. Brown was willing to 
take the blame and pay the fines for his son's offenses because he felt guUty for having neglected his 
son for many years. Maybe Mr. Brown had not kept the pool haU open, but had admitted he did in 
order that the fine could faU mto the hands of corrupt municipal authorities as payment foe dving him 
a hcensc. Perhaps Mr. Brown had suffered a severe personal shock that resulted m his doing thbgs 
which were not to his advantage. Perhaps Mr. Brown was protestbg the discrimbatory laws of his 
town which allowed some busbesses to remam open later than 1 a.m., even though there wire no 
orincipled reasons for domg this. He was protesting on prbciple, not because he thought the protest 
vould be to his advantage. A sophisticated bformal reasoner could conceive of possibilities such as 
>uese and, if a number of possibilities occur to a person when there is not enough information to 
adjudicate among them, then the person can justifiably choose Insufficient Data. 

As another possibility, imagme a less sophisticated person who had learned that busbess people often 
break the law to their advantage, if the fines are smaU enough. Suppose the person also believes that a 
fine of $500 is sufficiently large that the only explanation of a busbess person's repeatedly actbg so as 
to be levied such a fine is that the action is to the person's advantage. This examinee might justifiably 
choose rn<c. Either way, exambees reasonbg justifiably accordbg to their level of sophistication 
would be marked wrong on paper-and-pencil sittbgs. 

Background Empirical Beliefs of Examinees 

Examinees brbg different background beliefs to bear on multiple-choice informal reajomng tasks 
The effect of such differences can be Ulustrated usbg a question from Section II of the ComeU Critical 
Thinkmg Test Level X. RecaU that a team of explorers has landed on Nicoma to search for a team that 
has not been contacted b two years. The second team is explorbg the area around their landing site 
and has found some water. In the item, the task is to choose which, if either, of two underlined 
statements is more believable. 

27. A. The health officer says. "This water is safe to drink " 

B. Several others are soldiers. One of them says, "This water is not safy " 

C. A and B are equally believable. 

The answer keyed correct is that the health officer's statement is more believable, because the health 
officer should be more expert than the soldier on the potability of water and because experts speakbg 
in their own fields tend to be more believable than nonexperts. Suppose, however, an exammee 
beheves that the traimng of soldiers and the equipment they carry equips them to make as dependable 
tests of water safety as health officers. Such an exambee would choose £ as the answer, because the 
health officer and soldier are equaUy expert, but would be wrong accordbg to the answer key 
However the exammee would have known that expertise b a field tends to make one more credible 
and would have used that criterion for choosbg Q. This is precisely the bformal reasoning competence 
the Item is designed to reward, But the person choosbg A would be rewarded and the person choosing 
£ penalized, even though the difference between them would have been their background empirical 
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compttencT °^ ^^"^^^ ^'''^ 

Consider another example based on the test on AppraUmg Observations (Norris & Kine 1983^ 
Items are set m the context of a traffic accident and various >vitnesses and people involved b the 
accident are reportmg to pohce what they observed happenbg In Item 9, Ms. Vernon and Martine, 

o iule nArr!,'°1°^^^^ '^^^^'^ '^^''^ ^ ^'°P ^'8°- '^^ ^ask for examinees is 

to judge which of the underhned reports is more credible. 

9. Ms. Vernon then says, "I also remember that a_fa ncv blue snorf. rnr w^nr 
through the stop si gn " ^ 

Martine says, "A car \vith twin headliyhu u/^n t right thrmi f rh the ston sip n ■ 

This item U designed to test t'ae Principle of Observational Salience: Observations of more salient 
features of events tend to be more believable than observations of less salient features. Features of an 
event are salient to the e«ent that they arc extraordina^, colorful, novel, unusual, and interesting and 
not sabent to the extent that they are routme, commonplace, and insignificant. The answer keyed as 
correct, based on the empirical belief that bebg a fancy blue sports car is more salient than having t^vin 
headhghts, is that there is more reason to believe Vernon's statement. 

An exambee reasonbg as foUows would use the Prbciple of Observational Salience, but would not 
choose the answer keyed as correct. 

A fancy blue sports car is somethbg which would stand out, but blue is not as 
noticeable a colour as red and there are a lot of fancy blue sports cars around 
nowadays Twm headlights aren't as popular as they were b the past when just about 
every car had them, so they would stand out too. I believe neither would stand out 
more than the other, so the statements are equally believable. 

m exammee knew the prbciple of informal reasonbg being tested, but would have been marked 
^ong because of his or her empirical belief that havbg twm headlights is as salient a feature he^e 
days as bemg a fancy blue sports car. 

Assumptions Made by Examinees 

feSSLu'^i^M '^f assumptions whUe working on the same multiple-choice bformal 
reasomng items. Moreover there are different assumptions that can lead justifiably to different choices 
of answers. Consider the foUowmg items from the Interpretation subtest of the Watson-GlaseMest 

JhMf^ "r" • °' ""^^'"^ statements foUow beyond reasonable doubt from 

the mformation given m the paragraph. . ^ irum 

Pat had poor posture, had very few friends, was ill at ease b company, and in general 
was very unhappy. Then a close friend recommended that Pat visit Dr. Baldwin a 
reputed ejyert on helping people improve their personalities. Pat took this 
recommendation and, after three months of treatment by Dr. Baldwlii, developed 
more friendships, was more at ease, and b general felt happier. 

55. Without Dr. Baldwm's treatment, Pat would not have improved. 

56. Improvements b Pat's life occurred after Dr. Baldwm's treatment started. 

57. Without a friend's advice, Pat would not have heard of Dr. Baldwm. 

The keyed answers are that the statement b Item 56 foUows beyond reasonable doubt from the 

firf T ?i F^'^'P"? ^'^ ^° statements do not follow beyond reasonable 

doubt. In fact, the statement m Item 56 follows beyond all doubt, because the mformation includes the 
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fact that the improvements occurred after three months of treatment by Dr. Baldwin. This indicates a 
LTrhvT ^ because it seems the standards for bemg beyond reasonable doub t are 

taken by the test developers to be the same as those for being beyond all doubt. 

^.TrUtirf?!. '''' ^^«^day sense and 

ponders Item 55 as follows, makmg the assumptions mdicated: 

The statement is ambiguous between "would not have improved ever" or "would not 
have miproved during the three month period." It is obvious that there is insufficient 
mformation to say beyond reasonable doubt that Pat would never have improved 
without the help of Dr. Baldwin, so the statement must mean "would not have 
miproved dunng the three month period." But is it beyond reasonable doubt that he 
woiUd not have miproved during tWs three month period had he not received Dr 
Baldwin's treatment? Well, from the description, I assume that Pat had been 
suffermg m this way for a long time. Problems such as thisttypically do not occur 
overnight, nor typically go away quickly by themselves without professional help. I 
therefore assume that Pat's.problem was not one that would have gone away quickly 
on Its own. Given these assumptions, the most plausible explanation of Pat's 
improved condition is that it was brought about by the treatment and therefore, whUe 
I cannot be certain, it seems beyond reasonable doubt that without Dr Baldwin's 
treatment there would not have been such an improvement durmg the three months. 

Such an examinee would be reasoning well, but would choose an answer other than the one keyed 
correct and be penalized for that in a papcr-and-pencU sittbg. The person madt bstifiable 
crrdS'^^^H 'T ''f °' developerfand thesSren ts^tn^ 

TapellCrit™^ ^° ^•'"'^^ - ^'^^ -"''^ no credit m a' 

Ideologies of Examinees 

Conceptions of mformal reasonbg competence do not mcorporate or presuppose any political or 
rehgious Ideology. Bebg subject to reason might be considered an ideology but, if sJ 7hZt l 
pobtical or religious one. However, poUtical ideology can bfluence choi«S of a^wers on soL 
mformal reasonmg items. Consider, for example. Items 65 and 67 from the WaTon^Girer te^t 
ecammees are presented the question. "Would a strong labor party promote theVnerd welS 
t:tr:ilo^^',^^^^^ '"^^'^ -^"^ '° ^ 'efending those 

65. No; a strong labour party would make it unattractive for private bvestors to risk 
their money b busbess ventures, thus causbg sustained large scale 
unemployment. * 

67. No; labour unions have called strikes m a number of biportant industries. 

Examinees are to assume the reasons are true and to decide whether they make strong or weak 
rgumenu for the answers given. They are told that strong arguments are those wS^eS 
important and durectly related to the question. 

Item 65 is keyed as givbg a strong argument. However, for a laissez-faire exambee, the prospect of 
rr^^' unemployment might not be biportant compared to the mterferen« «quS to 

cnS . • u °' "S"""^"' " ^5 "^^^ ^ 'J^^^ly related to the quesUon, it k 

considered ummportant by the exambee and is, therefore, judged to be weak. On the other hml a 
social activist exambee might also mark Item 65 as weak, but for different reasons. The S^n St 
for example, believe that sustabed large-scale unemployment would be a good thbg be^uiTt wSd 
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^ouse the general public to revolt against the existing economic system. For this person, the reasons 
given jn the Item would not support the "No" answer to the question. 

Item 67 is keyed as giving a weak argument. However, a political conservative might consider the 
arpment both important and directly related to the question and, therefore, mark thfitem as strong 
The conservative might bebeye that a strong labour party would encourage unions, which would lead to 
strikes m important mdustr.es, and believe that such strikes would be detrimental to the general 
'S^^i:^^''''''' °^vcathesebeliefsthe,«rson could, While reasonin^gwe. 

Section Summaiy and Conclusion 

°- exambecs' choices of answers to tasks, even 

hough .t ^ the reasonmg which led to the choices and not the choices themselves that is of greatest 
interest There is no du-ect evidence for the rcasonbg foUowed, so it must be inferred from the choices 
uJT nl''- f ^ differences among examinees can make such fnferences untrustworthy: different 
levels of mformal reasoning sophistication; different background empirical beliefs: different 
assmnptions made while taking tests; and different poUtical and religious ideologies. This Uct on 
used Items from cositrng multiple-choice mformal reasoning tests to iUustraS how each of these 
differences can lead to mcorrcct inferences about exammees' informid reasonmg competence. 

The items used to make these iUustrations are not anomalies. They are indicative of a widespread 
problem m multiple-choice tests of mformal reasonmg. First, it is plausible that examinees with 
thIIkSSv K " c ""^^ sophistication, different background empirical beliefs, and so on 

tlunk d^fferen ly about items. Second, there is no direct evidence (one way or another) of the extent to 
which such differences affect the trustworthiness of the inferences about eiammees' reai^ning. 

wSl'tn^r't"'^ !f °^ multiple-choice informal reasoning tests, it is important to know 

whe her anything call be done to mcrease their validity. A multiple-choice test of informal reasoning 

l^Lt^'ff ' ^''^ '° "^IX"*^" ^^y^^ correct anTp^of 

nnl?2^if- '°JTr' ^T'^ '^^^ ^"'Ji^y that evidence is needed 

on the relationship between the answers examinees choose and their reasoning. One plausible way to 
CO lect such evidence ui to ask exammees to think aloud whUe working on trial items. Evidence 
gathered m this way ha;, been espoused often but used rarely m validating multiple-choice informal 
easomng tests But a test founded on such evidence could resist strongly fhe criticisms posed nThU 
section Therefore, I ..haU turn to an exploration of the usefubess of verbal reports ofVhinking fo^ 
improving multiple-choice informal reasonmg tests. ininxing lor 

Using Verbal Reports of Thinking to validate Tests 

Verbal reports of thinkbg contain information on the knowledge, strategies, and prbcioles of 
reasoning which lead to examinees' choices of answers. They are no! a meat of SLectfy 

I^3„frT.l' ^"''^ «P°'^ «"«We more trustworthy bferences about reasoning tha.i jus' 
an exammation of the answers chosen. 

Using verbal reports of thbkbg goes hand b hand with the construcUon of theories of human mental 
abJities, by proyidmg dbect evidence for hypotheses about reasonmg processes. The coSd 
olhtiTha to theory construction (Cronbach, 1971), L t iTn^S 

o«5^ T '"^'^ of ff^inees' thbkbg are relevant to construct validation (Haney & Scott, 

1987). If part of construct vahdation is the identification of the mental processes which underlie task 

SeZr? ""uT ^r'"?.''^" ^^^^^ ^ ^" «'""P^°" °f 'O'"^'^^ representation, then Se 
relevance of verbal reports of thmkbg to construct validation can be more readUy seen. A multiple- 
choice mformal reasonmg test would have construct validity to the extent that good performance 
defined m terms of the number of items answered correctly, could be explaJed brexambees' 
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follo>wng sound thinking and poor performance could be explained by unsound thinking. Verbal 
reports of exammees thmkmg while answering test questions can thus provide direct evidence for the 
construct validtty of a test. 

For verbal reports of thinking to be useful in the validation of an informal reasoning test, there must be 
a systematic procedure for collecting the reports, extracting information from them, and using that 
mformation for judging the quality of the test. More specifically, there needs to be a way to elicit 
verbal reports of examinees' reasoning that interferes with their reasoning as Kttlc us possible. There 
must be a means to use the information in the reports to judge examinees' reasoning independently of 
their answers to the test items, whUe being sensitive to different leveU of sophistication of informal 
reasomng. different background beUefs, different assumptions, and different political and religious 
ideolo^es. Finally, there inust be a way to compare answers chosen to the quality of reasonbg and to 
determine the extent to which good and poor reasonbg lead, respectively, to answers keyed correct and 
answers keyed mcorrect. 

There are several ways to eUcit from exammees verbal reports of their thinking on test items. They 
might be asked simply to say everything that comes to their minds as they work on a task 
Alternauvcly, they might be asked to justify their answers. Exambees might be probed with questions 
about the specifics of their reasonbg, by being asked whether such-and-such had anything to do with 
theij^ thmkmg and, if so, what role it played. Fbally, some lombbation of these approaches might be 

Whatever the specifics, it is not clear whether diffisrent elicitatic:. approaches yield more or less the 
same mformation on examinees' reasoning, or whether any approach yields trustworthy mformation on 
^"Jking. But for a test validatioi: methodology to rely on verbal reports of thiiiking, these issues must 
be danfied. It is not sufficient to argue, as I have done so far, that in prbciplc verbal reports of 
thmkmg ought to be.relevant to the valida'ion of multiple-choice tests of informal reasonbg. 

In reality, verbal reports of thinkbg are relevant to the validation of multiple-choice informal reasoning 
tests, only if the information on exambees' thinking which the reports contain is an accurate reOectioc 
of the thmkmg v^mch would have taken place had the ew abces taken the test b normal paper-and- 
pencU formac. Verbal reports of thinkbg require that subjects provide introspective reportTon the 
progress of their thmkmg or the reasons for their performance, often m the presence of an mvestigator 
It is not known how such requirements bfluence thbkbg and the smaU number of testbg studies 

t!^r. ^oJ^^n'^ :^'c °J '^""^^^ * 19^5 Connolly & Wantman, 1964: 

Kropp, 1956; McGuire, 1963; Schuman. 1966) have ignored the question. There is some relevanf 
research from non-testmg contexts, such as information processing research on the use of verbal 
reports as data and memory research on eyewitness testimony. I wiH thus briefly review the research m 
each of these areas. 

Verbal Reports as Data 

Research on the trustworthbess of verbal reports of mental processes pobts to conflictbg conclusions. 
On the one hand, Nisbett and Wilson (1977) conclude that people have little or no btrospective access 
!ni;f stimulate their cognitive processes. On the other Und, Ericsson and Sbion (1980, 

1984) and Smith and MiUer (1978) claim that people do have dependable access to their menta 
processes in certain situations. 

To support their conclusion Nisbett and Wilson reviewed evidence from the cognitive dissonance, self- 
perception attribution, learnbg without awareness, and problem-solvbg literatures. Based upon this 
evidence, they conclude three thbgs: (a) people often cannot accurately report the effects of certab 
stunuli on their responses to problems requirbg higher order thinkbg; (b) when people do report on 
such stimuh they often do not search their memories to discover what the stimuU were, but rather 
appeal to plausible hypotheUcal mechanisms which they accept a priori', and (c) when people are 
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correct about the stimuli affecting their responses they have coincidcntally appealed to a hypothesis 
which happens to be correct. Nabctt and Wilson argue that such coincidences occur when thVaclual 
causal stimulus is available to memory because, a prion, it is the most plausible cause of the response. 

Smith and Miller take issue with these conclusions, because the experimental situations upon which the 
conclusion are based do not support the generalizations made in them, in particular, experiments are 
situations m which the influential stimulus is "systematitiUy and effectively [hidden] from fsubiectsl by 
[the] experimental designs" (1978, p. 356). The influential stimulus can only be ascertincd by 
comparing the treatment and control groups and, of course, subjects m an experiment cannot do this. 
Therefore, Smith and Miller argue that Nisbctfs and Wilson's conclusions apply only to experimentally 
controlled situations m which subjects' unawarencss of what is influencing their thinking U a naturd 
ccascqucaee of the experimental setup. They claim that thtic experimental finding ave not 
generahzable to people's attempu to report on their mental processes outside of experimental settings. 
Reports of thinkmg on test items might thus faU outside the scope of Nisbett's and Wilson's 
conclusions, because testing does not usually attempt to hide influential stimuU from examinees. 

Ericsson and Simon (1980, 1984) discuss tie trustworthiness of verbal reports of thinking from an 
uiformation processmg perspective. They conclude that inst^Jctions to verbalize slow down, but do not 
change the course of, coguitive processing when subjecU are verbalizing information that would 
normally be available to them m short-term memory. Specific and directive probes alter cognitive 
processmg, however, as do requests to supply motives and reasons. This conclusion is particularly 
relevant for test vahdation contexts where reasons for answers might be souiiht. The conclusion 
suKcsts that some verbal reports ofshinking on test - terns might not be appUcablc to testing contexts in 
which verbal reporting is not done. ^ 

Ericsson and Simon make specific hypotheses about how different types of requests to think aloud can 
affect the trustworthiness of verbal reports of thmking. In particular, they hypothesize thai the less 
leading the probe employed the more acetate the informatio.s obtained, and that more information 
with an ovcraU lower trustworthiness can be obtained with more leadbg probes. These hypotheses 
need to be jested. 

It is not leptimate to assume that the research on verbal reports as data answers all the questions 
re evant to the use of verbal reports of thinking m testbg situations. Testing contexts are sufficiently 
different from experimental and information processmg research contexts that it is reasonable to 
«9ect that memoiy retrieval and information processing demands might also differ. In particular, test- 
takers make specific assumptions about how they should try to perform and how the results reflect 
upon them that are probably different from those made when mvolved in a psychological study. 

Eyenitness Testimony Research 

Eyewitness testimony is often contained in verbal reports given m response to questions. Verbal 
reports of thmkmg on tests arc similar sorts of things. In one situation, people try to remember what 
they have observed; m the other, they try to remember what they have thought. The remembering 
processes are likely related, though not identical. Thus, research on the factors which affect the 
accuracy of eyewitness lestimony is pertinent to the question of the accuracy C verbal reports of 
thinking on tests. The degree of pcrtmencc is tempered by dissimUarities between the two contexts: in 
one, the memory is of an external event, whereas in the other it is of an internal event; in one, the 
memory; is of events the more dUtant past, whereas m the other the memory is of events in the very 
recjnt pest. ' 

7 ... ey:'.wt^^^■iJ icsiLmony research most relevant to the present study explores the effect of different 
orjiiestiomng on Jhe accuracy of observation reports. Three categories of ouestiom, have been 
* .Loftus, 4979, 90): (a) those elidtirig/w report- (e.g., TeU us aU that you sar/); (b) those 
'^.ntrolled reports (e.g., "Give uj a description of what our assailant was «v5arin9"); and (c) 
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those eliciting alternate-choice reports (e.g., "Did your attacker have dark or light hair?"). Two general 
corclusions can be drawn on the basis of many independent studies of these types of questioning 

l!^^"" ^^'^^""^ ^ ^^^^> ^ Ra^l^l'""' 1978; Harris, 1973; Hilgard & Loftus 

1979; Lipton, 1977; Loftus & Pahner, 1974; Marquis, Marshall, & Oskamp, 1972). The first conclusion 
IS that free reports tend to be more accurate than any other type of report; controUed reports rank next 
m accuracy; and alternate-choice reports have the lowest degree of accuracy. The second conclusion is 
that the amount of information obtamed increases in the opposite direction: free reports contain the 
least amount of mformation; controUed reports contain somewhat more information; and alternate- 
choice reports contain the most information of aU. So then, free reports give a relatively lesser amount 
of relatively more accurate information, and alternate-choice reports a relatively greater amount of 
relatively less accurate information. Tc& results are consistent '.vith the hypotheses offered by Ericsson 
and Simon (1980, 1984). 

As with the research on verbal reports as data, it is not legitimate to assume that the results of 
eyewitness testimony research can be applied directly to testing. EUciting reports of thinking on tests 
IS different from ehcitmg recoUections of observed events and there is no research which explores how 
these differences affect the accuracy of both types of report. 

An Unresolved Problem 

Let us retrace. The evaluation of informal reasoning competence makes demands which traditional 
multiple-choice tests are not equipped to meet. Problems requiring informal reasoning for their 
solution often admit of more than one solution, but multiple-choice tests usuaUy have only one correct 
answer. Evaluators of informal reasoning are usually more interested in the process of examinees' 
reasoning than the product, but multiple-choice tests typically give no direct evidence on reasoninii 
processes. * 

Despite these problems, multiple-choice tests are likely to continue to be used ard to have considerable 
influence. Therefore, it would be worthwhile to have a way to validate the tests which can provide 
some direct evidence on the reasoning processes they elicit. One natural way to gain direct evidence on 
reasoning is to ask people to think aloud. Applied to the validation of multiple-choice informal 
reasomng tests, tests could be examined by asking samples of exambees to work on them and to report 
verbaUy on their thinking. Judgments could be made of whether or not good and poor informal 
reasomng led, respectively, to keyed and unkeyed answers. SpecificaUy, the evidence could indicate 
whether differences m performance across an intended audience for the test was significantly affected 
by such factors as differences in reasoning sophistication, background empirical beliefs, assumptions 
made, and rehgious or political ideologies. 

The idea is sound in the abstract. But there is stiU much to learn about how thinking aloud affects 
thmking Itself. More particularly, there is virtuaUy no research on the use of verbal reports of thinking 
m testmg contexts, and the verbal reports as data and eyewitness testimony literatures can be taken 
only as suggestive of what to expect in testing. The use of verbal reports of thinking to validate tests 
would be justified, only if their eUcitation does not alter significantly the course of examinees' thinking 
from what it would have been had they worked on the tests in paper-and-pencU format. If a significant 
alteration occurs, then mformation on the validity of tests derived from the verbal reports of thinking 
would not provide evidence on the validity of the tests for the paper-and-pencU sittings for which most 
are intended. It is therefore worth exploring whether verbal reports of thinking on multiple-choice 
mtormal reasomng tests provide relevant evidence on the validity of those tests. 
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Relevance of the Evidence in Verbal Reports of Thinking 

The issue of the relevance of verbal reports of thinking to vaUdating multiple-choice tests of informal 
reasoning was studied by trymg to answer two research questions: 

1. Do different ways of requesting verbal reports from examinees yield different 
information on their thinking? 

2. Does the act of verbally reportmg thinking alter exammees' test performance? 

The first question pertains to the role of the micrview procedure. As stated earUer, slight changes m 
the V. jrdmg of mterrogations of eyewitnesses can cause different accounts of events to be given Is the 
same true when asking exammees to verbaUy report their thinking? The second question addresses the 
issue of how verbally reporting one's thinking alters the comse of that thinking. If significant 

feSvlS.- f>? ^u'^'^u'j^ '''' performances between examines who 

verbaUy report theur thmkmg and those who do not. 

Description of the Study 

To help answer these questions, 343 senior high school students from four high schools participated m 
an experiment. Verba^ reports of their thinking were eUcited as they worked through Part A of th^ 
Jhn °? ^T"^^ Observations (Norris & King, 1983). As described previously, it is a muWpIe- 
choice test focussed on one aspect of mformal reasoning competence, the ability to judge the credibUity 
of reports of observations. In Part A, items are cast in the context of a traffic acddent. In S iS 
SSS !' '''•^f °/ ^'J'"^"^ involved in the accident, provide accounts of what happened. 

&ammees are to judge which, if either, of the accounts is more credible. Judgments should be based 
M of either the observers, the observaUon conditions, or the sSTement of observation 

A completely randomized factorial design was used. Students were randomly assigned to one of five 

1. No Probe Group: Students were not mterviewed and worked alone on the test m a 

paper-and-pencil format. 

2. niink Aloud Group: Students were instructed to report aU they were thinking as they 

worked through the items. 

3. Immediate Recall Group: Students were asked to choose their answers to each question and to 

justify their choices immediately after each was made. 

4. Criteria Probe Group: While working on each question, students' attention was drawn to a 

particular piece of information m it. They were asked whether that 
information made any difference to the answers they chose and, if 
so, to explain the difference. 

5. Principle Probe Group: Students were treated as m the criteria probe group, except they 

were asked an additional question aimed at determining whether 
their choices were based upon particular general principles. 

The no probe group simulated conditions under which the test would normaUy be given. Students 
worked alone at a desk and marked their answers on an answer sheet. In the think aloud group there 

Srurn rS- ^ ^^^y fi^' ^'"^^^ ^^^y ^he general 

instruction to hmk aloud was given. In subsequent groups, students responses were constrained by 
leading requests for particular sorts of information. The types of probes vary analogously in degree of 
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leadingness to those studied m eyewitness testimony researcli. If the results of that research generalize 
^^estmg situations, then students' verbal reports of thinking should vary depending upon their probing 

Let us consider how the system would proceed for each of the groups working on a given item. Here is 
Item 3 as an example; 

A poUcewoman has been asking Mr. Wang and Ms. Vernon questions. She asks Mr. 
Wang, who was one of the people involved in the accident, whether he had used his 
signal. 

Mr. Wang answers, "Yes, I did use mv sig nal" 

Ms. Vernon had been drivmg a car which was not involved in the accident. She tells 
the officer, "Mr, Wang did not use his sipal But this didn't cause the accident 

Students were to choose which, if either, of the underlined statements is more credible. In addition, 
the foUowmg mstructions were given to students in each interwewed group: 

Interviewed Group Instructions to Examinees 

Think Aloud Try to teU me aU that comes to your mind as you thmk about this 

question. 

Immediate Recall Which answer do you choose? ... Can you teU me why you chose 

that answer? 

Criteria Probe Which answer do you choose? ... Did the fact that Mr. Wang was 

involved in the accident affect your choice? 

Principle Probe which answer do you choose? ... Did the fact that Mr. Wang was 

mvolved in the accident affect your choice? ... (If 'No") Go on to 
the next item. (If "Yes") What difference did it make to your 
thinking that he was involved? 

Students' verbal reports were tape recorded and transcribed verbatim. All students were assigned 

5f r/r'f S"'"- '° °^ "^^^'^^ "'"""y to the key provided 

with the test (Noms & Kmg, 1985). Students who had given verbal reports were also assigned Tlmking 
Scores These scores mdicated the quality of thinking displayed b the verbal reports on a scale of 0-3 
tor each item. They were assigned independently of answers chosen. 

Quality of thinking was judged by comparing students' verbal reports to ideal models of thbking 
developed for each item The models were based upon a set of principles for assessing the credibility 
of observations, knowledge of which the test was designed to measure. Staying with Item 3, the ided 
model was based as follows on the principle that people in a conflict of interest tend to be less credible 
than those not m a conflict of interest: 

Mr. Wang was involved in the accident, but Ms. Vernon was not mvolved. Mr Wang 
IS less credible because his involvement would give him reason to say he used his 
signal even if he did not. Wang is in a conflict of interest. People in a conflict of 
interest, that is, people who have something to gain by events being cast as they 
described them, tend to be less credible than those who are not in such a situation. 

According to the model, an examinee first needs to identify in the text the relevant information about 
Wan^s and Vernon s mvolvement. The text is simple enough that reading ability should not impede 
this Identification for most high school students. Second, an examinee must remember from 
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experience that not usmg a turn signal can cause an accident and that bebg held responsible for an 
accident can be troublesome. High school students should have ready access to such common 
knowledge. Finally, an exammee has to recognize that being m a conflict of interest is an accuracy- 
reducing factor and apply this prmciple to make a credibiUty judgement. 

So for Item 3, thinking scores were assigned according to the following scale: 

1 point: The exammee pomts out that Mr. Wang was involved in the 

accident. 

2 points: The exammee pomts out that Mr. Wang was involved in the accident 

and compares Mr. Wang's mvolvement to Ms. Vernon's bebg a 
bystander. 

3 points: The exammee pomts out that Mr. Wang was mvolved m the 

accident, compares this with Ms. Vernon's non-mvolvement, and 
shows that this is an instance of a more general phenomenon in 
which people stand to profit or lose dependmg upon what they say. 

0 points: The exammee does none of the above or does not respond. 

foUoS^"^ '° ^ '''"'^"^ ^^'^ ^'^^'^ '^^8 each of the 

1. citing the relevant facts m the text which can be used to compare the underlined 
statements for their credibility; 



2. 



using these facts together with any relevant background knowledge to make 
comparative evaluation of the credibility of the statements; 



3. showing how the evaluation is based on a generalized accuracy-reducmg factor. 

To illustrate the procedure more clearly, let us examme a transcript of one student's verbal report of 
thinking on Item 3. The student said: ^ 

The second one 'cause, ah, 'cause he'd say that he used the signal so he wouldn't have 
nothing to do with the accident. Probably afraid he'd have . . . he'd be questioned bv 
the pohce or something. ' 

Note that verbal reports of thinking tend to have many colloquialisms, repetitions, and false starts 
.ull uV^^^ these thmgs must be overlooked when ratmg examinees' thinkmg. This 
student would be assigned a thmkmg score of 2. There is judgement mvolved m this decision, because 
uTfZ T^'^f one-to-one correspondence between what the student said and the ratmg scale above. 
?J1 T ^ ''T^'^ accuracy-reducmg role of Wang's being involved m the accident. 

Ihe studen id not ate e;q,hcitly the facts that Wang was mvolved and Vernon was not, but these were 
clearly miphci m the student's thinking. The student would not be given a 3, because no geS 
principle was cited. « s 

Results 

The verbal reports of thmking, the thinking scores, and the performance scores were analzyed 
quantitatively and quahtatively m an attempt to answer the two questions raised at the beginning of the 
previous section (for more details see Norris, 1985): 
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1. Do different ways of requesting verbal reports from examinees yield different 
information on their thmking? 

2. Does the act of verbally reportbg thinking alter examinees' test performance? 

The results of tb» quantitative analysis of thinking scores showed no statistically significant differences 
across the four £• oups that were bterviewed. So in answer to the first question, whatever other effects 
the different rypt s of probes might have had, they did not affect the quality of students' thinking as 
measured by the thinkin'; score scale. 

To further examme the thinkmg of students m the different mterview groups, a qualitative analysis was 
conducted of a random sample of 40 (stratified by mte^^^ew group) of the total sample of 271 
mterviews. A coding scheme was devised for mdicatmg a variety of verbal moves in the examinees' 
verbal reports. The moves are as follows: 

CiUng Factual Details - either recalling a factual detail given m an item prior to the 
one currently being done, recalling such a prior detail mcorrectly, or statmg a detail m 
the current item; 

Asking Rhetorical QuesUons - posmg questions which appear to be directed to the 
exammee himself or herself rather than to the mterviewer; 

MaUng Evaluations - either evaluatmg judgments or conclusions which had been 
previously explicidy stated, or evaluating ones which had not been verbalized; 

Constructing Sapportrng Assumptions - either making detailed factual assumptions 
specific to the current item, or making more generalized assumptions of broad 
principles of appraisal or causal laws coverbg more than the situation in the current 
item; 

Using Attention Control Devices - either making comments about the stage of 
progress reached m reasoning through the problem (Let's see. Where was I?), or 
commentbg on the direction reasoning should proceed (Wait now!); 

InteracUng with the Experimenter - directmg comments or questions to the 
experimenter; 

Pausing - either making verbal inflections (Ahhh! Mmmm!) or bemg silent. 

The 40 verbal reports of thinking were coded accordmg to the seven categories and frequencies of 
verbal moves calculated. These frequencies are given in Table 1. No systematic analysis was done on 
these data, but they were exammed for general trends with a view to more systematic exploration in the 
future. Note that there are clear differences m frequencies of occurrence among verbal move 
categories. However, there are no glarmg differences m trends across the interview groups, supporting 
the conclusion of the quantitative analysis that there was no difference m quality of thinkmg across the 
four interviewed groups. 

[Insert Table 1 about here.] 

Bcih the quantitative and qualitative results suggest strongly that it was subjects' thinking and not how 
that thinking was elicited that controUed what was reported. If this conclusion can be substantiated in 
other studies and for other tests, then it would seem that the accuracy of verbal reports of thinking on 
multiple-choice informal reasoning tests is not as sensitive to the type of probing as research in other 
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contexts would indicate. That is, testing may be a context whose demands are sufficiently unique that 
the use of verbal reports of thinking on tests needs study in its own rigJ'?. 

The second question asked whether the act of verbally reporting thinking alters exammees' test 
performance. Analysis showed that there are no statistically significant differences in performance 
scores between any of the interviewed groups and the group who took the test in the paper-and-pencil 
format. This result suggests that probing did not alter thinking, because if the course of examinees' 
thmking had been altered by giving verbal reports on their thinking, then this alteration should have 
been revealed m altered performance. It is hard to imagine how their thinking could have differed in a 
systematic fashion while their performances stayed precisely the same. 

Discussion and Conclusions 

Whenever no differences between treatments is the result of an experiment, the power of the 
experiment to detect differences which actuaUy exist becomes an important concern. Was this 
experunent sufficiently powerful to detect any differences which existed among the groups' There are 
a number of reasons to bellove that dLferences would have been detected had they been present in the 
population First, the treatments were considerably different from one another. It is quite different for 
high school students to work alone on a test than to work in the presence of a stranger who is probing 
their thinking. TTius, if ebcitmg verbal reports of thinking tend to alter the course of thinking, then 
alterations should have been revealed in differences in performance between the mterviewrd and 
unmterviewed groups. 

Second, the interview treatments themselves were considerably different. The leading probes were 
quite leadmg, because they made explicit suggestions to students about ^vhat could have affected their 
choices of answers. It would have been an easy matter for students to conform to these suggestions by 
altermg their answer choices and their way of thinking about items. Instead, students denied regularly 
that a suggested factor had anything to do with their thinking and proceeded to explam how their 
choices were made. That is, students tended to maintain whatever interpretation made sense to them. 

Third, effects were sought from a number of different directions, but were found in none of them The 
quantitative analysis showed no differences in thinking scores across the four interviewed groups and 
no dilferences m performance scores across aU five groups. The qualitative analysis showed that the 
same patterns of verbal moves were used by students in each of the interviewed groups. It is plausible 
to think that if the verbal reporting altered student' thinking it would have been detected by at least 
one of these methods. ' 

Fourth, eyewitness testimony research uncovers consistent effects using simUar sorts of treatments, 
ihis does not mean that differences should have been found in this study, but it does mean that if 
differences existed they should have been detected. 

Finally, an analyse of the statistical power of the experiment showed less than a 5% chance that real 
dilterences existed among the groups but were not detected. 

This research points to a useful technique for validating multiple-choice tests of informal reasoning. 
Eliciting verbal reports of examinees' thinking is a plausible way to gather data on the quality of tests. 
This study bolsters confidence in the technique by showing that there is no need to be overly cautious 
about the leadingness of questions used to eUcit reports of thinking. Examinees' thinking is not altered 
by requests to report on their thinking, so the information in the reports is relevant evidence for the 
validity of tests. Such evidence can show whether sophistication, background empirical beliefs 
Ideologies of reasoners, assumptions reasoners make, and other factors affect performance on 
multiple-choice informal reasoning tests. 
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CoUectmg verbal reports of thitJdng on existmg multiple-choice informal reasoning tests should 
therefore provide important evidence on the validity of those tests. Given the level of suspicion cast on 
them by the sorts of criticisms discussed earlier, such evidence is needed. It is important to know, one 
way or the other, whether or not existing multiple-choice informal reasoning tests are valid. 

The results of such validation studies might be mixed. For instance, whereas many multiple-choice 
mformal reasoning tests are advertised for wide ranges of audiences, verbal reports of thinking from 
subjects across the entire range may indicate that the advertised appUcability of a given test should be 
n^rower. As a consequence, the advertised range of appUcability might be altered or, usbg the 
information in the verbal reports of thinking, versions of a given test suitable for more narrowly defined 
audiences might be designed. These versions may differ considerably from each other, or may only 
differ m keyed responses. It might be possible, for instance, to tailor answer keys to different audiences 
to take account of such factors as sophistication, empirical beUefe, ideologies, and so on. As far as I 
know, this approach has never been tried with multiple-choice informal reasoning tests, but the 
mformation m verbal reports of examinees' thinking could provide a basis for such trials. 

The idea of usbg verbal reports of thinkbg to tailor answer keys to different audiences suggests a 
developmental (m addition to validation) role for verbal reports. There is no reason to wait untU tests 
have been developed before usmg verbal reports of thinking to check their validity. Verbal reports of 
Uunking on trial items of a test under development can provide evidence for retaining, modifying, or 
discarding items. With a systematic procedure for quantifying and usmg this evidence to judge 
individual Items and the test as a whole (Norris, 1988), validity can be "buUt into a test" from the item 
evel on up. Verbal reports of thinking thus open the prospect of developbg valid multiple-choice tests 
to do the sorts of mformal reasoning assessment for which they are most suited. 

However, not aU informal reasoning assessment can be served by multiple-choice testbg. The Test on 
Appraismg Observations, used as an example b thU chapter, assesses the ability to apply criteria one at 
a tune to make appraisals of credibility. But b a real-worid context of appraisbg the credibility of a 
witness several of the criteria would likely apply at once. Some of the criteria might push the appraisal 
in one direction, others b another direction. The criteria would have to be weighed and balanced and 
there are no stnct rules for dobg this. Judgement based upon experience would have to be used 
Multiple-choice tests are not useful for assessbg how weU people use their judgement to orchestrate a 
number of mformal reasonbg skills to work on ill-defined, real-world problems. Other assessment 
methods must be developed. 

Informal reasonbg dispositions also pass through the mesh of multiple-choice informal reasonbg tests 
but reasomng dispositions are as important to assess as reasonbg abilities. The assessment of 
dispositions IS logically a two-stage process, because faUure to perform weU (e.g., to give alternative 
hypotheses when appropriate) could be explabed either by lack of knowledge that givbg alternatives is 
appropriate, lack of ability to generate alternatives, or lack of disposition (given the knowledge) to 
provide alternatives. The possibilities of lack of knowledge and ability must be ruled out before lade of 
disposition can be accepted as the explanation. Assessment of dispositions is doubly complex and there 
are no adequate techniques for assessbg dispositions to be open-mbded, to seek reasons, to seek 
alternatives, to seek critical feedback, and so on. Furthermore, it is not clear at this thne how these 
assessments might be done best. Essay testbg, bterviewbg bdividuals, and direct classroom 
observation are approaches with promise (Norris & Emiis, b press), but considerable research is 
needed. 



We- the problems of informal reasonbg assessment are large, tiiey are surmountable. Many 
problems stem from the fact that educators have only recently taken seriously instruction in reasoning 
Assessment practices which are adequate for goals of instruction that focus primarily on learning 
factual knowledge are not adequate for assessment of informal reasonbg. Therefore, because of the 
new-found goal of teachmg reasoning, many assessment practices wiU have to change, some wiU have to 
go, and new practices wiU have to take their place. This chapter showed why changes are needed and 
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how changes can be made to multiple-choice testing to make it more suitably meet the goals of 
informal reasomng assessment. * 
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Table 1 

Frequency of Verbal Moves by Interview Group 



Interview Group 



Verbal Moves 


Think 




Crit. 


Princ. 




Aloud 


Recall 


Probe 


Probe 


Citing Factual Details 


104 


139 


99 


139 


Asidng Rhetorical Questions 


16 


9 


2 


5 


Making Evaluations 


45 


24 


39 


43 


Constructing Assumptions 


178 


228 


214 


227 


Attention Control 


26 


25 


15 


19 


Interact with Experimenter 


19 


9 


12 


13 


Pausing 


499 


387 


424 


380 
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